Topic Modelling & Image Classification
Importing libraries
import numpy as np
import pandas as pd
import random
from sys import getsizeof
import gql
import os
import matplotlib.pyplot as plt
import seaborn as sns
import shutil
import warnings
warnings.filterwarnings('ignore')
Using GraphQL API to load 3 reviews and 1 photos for 50 restaurants from 20 different locations:
from gql import gql, Client
from gql.transport.aiohttp import AIOHTTPTransport
import json
locations = ['San Francisco', 'New York', 'Seattle', 'Philadelphia', 'Houston', 'Chicago', 'Denver', 'San Diego', 'Phoenix', 'San Antonio',
'Nashville', 'Los Angeles', 'San Jose', 'Indianapolis', 'Fort Worth', 'Oklahoma City', 'Miami', 'Boston', 'Austin', 'Portland']
header = {'Authorization': 'bearer {}'.format("fwOPAjnuCwq93veb-5IBlBiW14fiAQODhmhInODSkfVoj7m1VcWYWFgGC-u5v0og_IA7gESAq-hIcr3MT9TIXyCUPv99I9BgkHrwOaF0uAT3FPmqB1H0pdtGmPrkY3Yx"),
'Content-Type':"application/json"}
# Select your transport with a defined url endpoint
transport = AIOHTTPTransport(url='https://api.yelp.com/v3/graphql', headers=header)
# Create a GraphQL client using the defined transport
client = Client(transport=transport, fetch_schema_from_transport=True)
if os.path.isdir("data"):
shutil.rmtree('data')
os.mkdir("data")
listresultJson = []
# Provide a GraphQL query
# Execute the query on the transport
for index,l in enumerate(locations):
result = await client.execute_async(gql(
'''{search(location: "'''+l+'''", limit:50) {
business {
name
photos
price
review_count
reviews {
text
rating
time_created
}
location {
city
state
postal_code
country
}
categories {
alias
parent_categories {
alias
}
}
}
}
}
'''
))
listresultJson.append(result['search']['business'])
# extrating the data fetched
cList = []
for l in listresultJson:
for x in l:
cList.append(x)
with open('data/data.json', 'w') as output_filec:
json.dump(cList, output_filec)
with open('data/data.json') as sam:
d = json.load(sam)
# Converting the data into a pandas dataframe and saving the csv file
data = pd.json_normalize(d)
data.to_csv("data/data.csv", sep='\t')
data
| name | photos | price | review_count | reviews | categories | location.city | location.state | location.postal_code | location.country | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Fog Harbor Fish House | [https://s3-media2.fl.yelpcdn.com/bphoto/by8Hh... | $$ | 9992 | [{'text': 'Enjoyed celebrating my bday with my... | [{'alias': 'seafood', 'parent_categories': [{'... | San Francisco | CA | 94133 | US |
| 1 | House of Prime Rib | [https://s3-media4.fl.yelpcdn.com/bphoto/HLrja... | $$$ | 8875 | [{'text': 'Never disappoint! Great food and ve... | [{'alias': 'tradamerican', 'parent_categories'... | San Francisco | CA | 94109 | US |
| 2 | Kokkari Estiatorio | [https://s3-media2.fl.yelpcdn.com/bphoto/FTQfP... | $$$ | 5188 | [{'text': 'Exceptional food all around, from t... | [{'alias': 'greek', 'parent_categories': [{'al... | San Francisco | CA | 94111 | US |
| 3 | Marufuku Ramen | [https://s3-media4.fl.yelpcdn.com/bphoto/ouK2V... | $$ | 4919 | [{'text': 'Very nice restaurant. Good ambiance... | [{'alias': 'ramen', 'parent_categories': [{'al... | San Francisco | CA | 94115 | US |
| 4 | Gary Danko | [https://s3-media1.fl.yelpcdn.com/bphoto/Rqsfo... | $$$$ | 5927 | [{'text': 'Gary Danko is our favorite SF resta... | [{'alias': 'newamerican', 'parent_categories':... | San Francisco | CA | 94109 | US |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 995 | Pip's Original Doughnuts & Chai | [https://s3-media2.fl.yelpcdn.com/bphoto/vZljJ... | $ | 3070 | [{'text': 'There was quite the line on a Satur... | [{'alias': 'coffee', 'parent_categories': [{'a... | Portland | OR | 97213 | US |
| 996 | Ava Gene's | [https://s3-media3.fl.yelpcdn.com/bphoto/sckkK... | $$$ | 752 | [{'text': 'I can't say enough about what an in... | [{'alias': 'newamerican', 'parent_categories':... | Portland | OR | 97202 | US |
| 997 | Farmhouse Kitchen Thai Cuisine | [https://s3-media2.fl.yelpcdn.com/bphoto/egThi... | $$ | 557 | [{'text': 'You got a craving for beef noodle s... | [{'alias': 'thai', 'parent_categories': [{'ali... | Portland | OR | 97209 | US |
| 998 | Gilda's Italian Restaurant | [https://s3-media2.fl.yelpcdn.com/bphoto/QL2FW... | $$ | 618 | [{'text': 'I had almost completely forgotten ... | [{'alias': 'italian', 'parent_categories': [{'... | Portland | OR | 97205 | US |
| 999 | Bluefin Tuna & Sushi | [https://s3-media4.fl.yelpcdn.com/bphoto/SkMBs... | $$$ | 259 | [{'text': 'I knew I wanted to eat sushi in Por... | [{'alias': 'sushi', 'parent_categories': [{'al... | Portland | OR | 97232 | US |
1000 rows × 10 columns
Using json reviews data provided by yelp for sentiment analysis. Taking 5000 random reviews from the json dataset.
# json file for reviews
reviewFile = "/kaggle/input/yelp-dataset/yelp_academic_dataset_review.json"
# assigning data types to the feature for memory optimization
features = {
"review_id": str,
"user_id": str,
"business_id": str,
"stars": 'int8',
"useful": 'int8',
"funny": 'int8',
"cool": 'int8',
"text": str,
"date": "datetime64[ns]",
}
chunks = [] # Initialize an empty list to store chunks
with pd.read_json(reviewFile, dtype=features, chunksize=100000, lines=True) as jsonReader:
for chunk in jsonReader:
chunks.append(chunk) # Append each chunk to the list
reviewData = pd.concat(chunks, ignore_index=True) # Concatenate all chunks into a single DataFrame
reviewData = reviewData.sample(5000, random_state=42)
reviewData
| review_id | user_id | business_id | stars | useful | funny | cool | text | date | |
|---|---|---|---|---|---|---|---|---|---|
| 1295256 | J5Q1gH4ACCj6CtQG7Yom7g | 56gL9KEJNHiSDUoyjk2o3Q | 8yR12PNSMo6FBYx1u5KPlw | 2 | 1 | 0 | 0 | Went for lunch and found that my burger was me... | 2018-04-04 21:09:53 |
| 3297618 | HlXP79ecTquSVXmjM10QxQ | bAt9OUFX9ZRgGLCXG22UmA | pBNucviUkNsiqhJv5IFpjg | 5 | 0 | 0 | 0 | I needed a new tires for my wife's car. They h... | 2020-05-24 12:22:14 |
| 1217795 | JBBULrjyGx6vHto2osk_CQ | NRHPcLq2vGWqgqwVugSgnQ | 8sf9kv6O4GgEb0j1o22N1g | 5 | 0 | 0 | 0 | Jim Woltman who works at Goleta Honda is 5 sta... | 2019-02-14 03:47:48 |
| 3730348 | U9-43s8YUl6GWBFCpxUGEw | PAxc0qpqt5c2kA0rjDFFAg | XwepyB7KjJ-XGJf0vKc6Vg | 4 | 0 | 0 | 0 | Been here a few times to get some shrimp. The... | 2013-04-27 01:55:49 |
| 1826590 | 8T8EGa_4Cj12M6w8vRgUsQ | BqPR1Dp5Rb_QYs9_fz9RiA | prm5wvpp0OHJBlrvTj9uOg | 5 | 0 | 0 | 0 | This is one fantastic place to eat whether you... | 2019-05-15 18:29:25 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5884448 | bXXRzBg7DuGnY8ij4INBWg | 9fP3KiiVpFVYcnqgD3aZJw | iaBU5h_j0TCrUFzTbjFIlw | 3 | 9 | 0 | 0 | I am not sure what to think of this place. I b... | 2013-04-09 22:29:48 |
| 6745875 | FkekUQC8z63ywSFQnK4Z4w | JLW2uULP_Q1KXHhToNljcQ | jMStvE-tQzSpRCAO0nAE6g | 3 | 5 | 2 | 8 | I'm so excited to see the red Robin had re-ope... | 2018-09-27 23:47:13 |
| 5730804 | 4IzbwfjgwUq1gXKA97Erwg | lESGYBwhs9ZtpWeJf_2Zig | hGCETx03FN8Qtx1T8StHaQ | 5 | 0 | 0 | 0 | This is our go-to pizza place! We love their ... | 2018-09-05 23:00:37 |
| 1995249 | 23xRe5HtAsPlHyUuM7AFTQ | 5pgl40PSrB-dTbEg-eWIFA | ecapYwbEvmKHKAfsGA4tow | 4 | 3 | 0 | 0 | This is located in a great spot fairly close t... | 2014-02-13 22:54:43 |
| 6544963 | vLxH2ifmZw8htzm_WZCGVw | W0DJOPsSwcAj0uqCJG8iLw | aGOXuqO6yhN66tLYI61Thg | 2 | 1 | 0 | 0 | I went in for a sirloin burger and a salad. Th... | 2015-05-08 02:42:30 |
5000 rows × 9 columns
reviewData.isnull().any()
review_id False user_id False business_id False stars False useful False funny False cool False text False date False dtype: bool
from wordcloud import WordCloud
plt.figure(figsize=(20,10))
#Creating the text variable
textWordCloudBefore = " ".join(cat.split()[1] for cat in reviewData.text)
# Creating word_cloud with text as argument in .generate() method
word_cloud = WordCloud(collocations = False, background_color = 'white', width=2000, height=1000).generate(textWordCloudBefore)
# Display the generated Word Cloud
plt.imshow(word_cloud, interpolation='bilinear')
plt.axis("off")
plt.show()
Capital letters and full stops, question marks, commas, colons and semi-colons, exclamation marks and quotation marks.
import string
string.punctuation
processedReviewData = reviewData.copy()
processedReviewData.reset_index(drop=True, inplace=True)
#defining the function to remove punctuation
def remove_punctuation(text):
punctuationfree="".join([i for i in text if i not in string.punctuation])
punctuationfree="".join([i for i in punctuationfree if i not in ['\n', '\t', '\b']])
return punctuationfree
#storing the puntuation free text
processedReviewData['text_punct_reml']= processedReviewData['text'].apply(lambda x:remove_punctuation(x))
processedReviewData['text_punct_reml']
0 Went for lunch and found that my burger was me...
1 I needed a new tires for my wifes car They had...
2 Jim Woltman who works at Goleta Honda is 5 sta...
3 Been here a few times to get some shrimp They...
4 This is one fantastic place to eat whether you...
...
4995 I am not sure what to think of this place I bo...
4996 Im so excited to see the red Robin had reopene...
4997 This is our goto pizza place We love their cr...
4998 This is located in a great spot fairly close t...
4999 I went in for a sirloin burger and a salad The...
Name: text_punct_reml, Length: 5000, dtype: object
processedReviewData['text_lower']= processedReviewData['text_punct_reml'].apply(lambda x: x.lower())
Using library langdetect to check the languages other than english and removing those rows from the data, langDetect check the most frequenct phases ex: "like to" to check if they exist marking it the corresponding langauage.
pip install langdetect
Collecting langdetect
Downloading langdetect-1.0.9.tar.gz (981 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 981.5/981.5 kB 26.7 MB/s eta 0:00:0000:01
Preparing metadata (setup.py) ... done
Requirement already satisfied: six in /opt/conda/lib/python3.10/site-packages (from langdetect) (1.16.0)
Building wheels for collected packages: langdetect
Building wheel for langdetect (setup.py) ... done
Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993224 sha256=53b09ceee945e52ed79afc8d745d19bfb2b9b761188cf15df169afd7dc67090d
Stored in directory: /root/.cache/pip/wheels/95/03/7d/59ea870c70ce4e5a370638b5462a7711ab78fba2f655d05106
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9
Note: you may need to restart the kernel to use updated packages.
import langdetect
languages_langdetect = []
# the try except blook because there is some tweets contain links
for line in processedReviewData['text_lower']:
try:
result = langdetect.detect_langs(line)
result = str(result[0])[:2]
except:
result = 'unknown'
finally:
languages_langdetect.append(result)
processedReviewData['languages']=languages_langdetect
processedReviewData['languages'].unique()
array(['en', 'es'], dtype=object)
for l in processedReviewData['languages'].unique():
if l != 'en':
print(processedReviewData[(processedReviewData['languages']==l)].text)
3543 El po boy estaba bueno. No probé más nada pero... Name: text, dtype: object
Rows of different language than english, droping them.
for l in processedReviewData['languages'].unique():
if l != 'en':
processedReviewData.drop(processedReviewData[(processedReviewData['languages']==l)].index, axis=0, inplace=True)
processedReviewData.reset_index(inplace=True)
Stopwords are English words which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence.
# Cleaning the texts
import nltk
import re
from nltk.corpus import stopwords
from gensim.parsing.preprocessing import STOPWORDS
def cleaningText(text):
sentences = nltk.sent_tokenize(text)
review = ""
for i in range(len(sentences)):
review = re.sub('[^a-zA-Z]', ' ', sentences[i])
review = review.split()
review = [word for word in review if not word.lower() in set(stopwords.words('english'))]
review = [word for word in review if not word.lower() in set(STOPWORDS)]
for x in review:
if x.lower() == 'i':
print(x)
review = ' '.join(review)
return review
processedReviewData['cleanText'] = processedReviewData['text_lower'].apply(lambda x : cleaningText(x))
processedReviewData['cleanText']
0 went lunch burger meh obvious focus burgers di...
1 needed new tires wifes car special order day d...
2 jim woltman works goleta honda stars knowledge...
3 times shrimp theyve got nice selection differe...
4 fantastic place eat hungry need good snack goo...
...
4994 sure think place bought groupon year ago arriv...
4995 im excited red robin reopened closer tucson ma...
4996 goto pizza place love crust toppings perfect d...
4997 located great spot fairly close downtown beach...
4998 went sirloin burger salad sirloin burgers got ...
Name: cleanText, Length: 4999, dtype: object
Lemmatization : Technique used to reduce inflected words to their root word. It describes the algorithmic process of identifying an inflected word’s “lemma” (dictionary form) based on its intended meaning. \ Tokenization : spliting a text into small units called token. \ Neutral words : words like noun, verb, auxillaries etc which are not adding any meaning to the sentiment of the text.
import spacy
from spacy.lang.en import stop_words as spacy_stopwords
stop_words = spacy_stopwords.STOP_WORDS
nlp = spacy.load('en_core_web_lg')
extraStopwords = ['ve', 'll', 'm', 's', 'd', 'ny', 'st', 'woo', 'n', 'ish']
neutralTags = ['NN', 'NNP', 'NNS', 'NNPS', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'PRP', 'WP', 'RB', 'RBR', 'RBS', 'IN', 'DT', 'CC']
initialTags = ['ADV', 'NOUN', 'VERB', 'PROPN', 'PRON', 'AUX', 'CCONJ', 'PART', 'SYM', 'SPACE', 'PUNCT', 'DET', 'CONJ', 'X']
# lemmatization
processedReviewData['text_lemmatized']=processedReviewData['cleanText'].apply(lambda x:[token.lemma_ for token in nlp(x) if token.pos_ not in initialTags])
# rechecking for the stopwords
processedReviewData['text_lemmatized'] = processedReviewData['text_lemmatized'].apply(lambda p:[x for x in p if str(x.lower()) not in set(STOPWORDS) and str(x.lower()) not in stop_words and str(x.lower()) not in extraStopwords])
# rechecking the neutral words
processedReviewData['text_lemmatized'] = processedReviewData['text_lemmatized'].apply(lambda t: [token for token in t if nltk.pos_tag([token])[0][1] not in neutralTags])
processedReviewData['text_lemmatized']
0 [obvious, different]
1 [new, special, ready]
2 [knowledgeable, personable, fantastic]
3 [nice, different, great]
4 [fantastic, good, good, good]
...
4994 [brazilian, bad, ineffective, complete, ineffe...
4995 [red, busy, typical, good, open, great, able, ...
4996 [ultimate, busy, extra]
4997 [great, walkable, accessible, nice, expensive,...
4998 [small, live, busy, grand]
Name: text_lemmatized, Length: 4999, dtype: object
For LDA
# lemmatization
initialTagsLDA = ['ADV', 'PRON', 'AUX', 'CCONJ', 'PART', 'SYM', 'SPACE', 'PUNCT', 'DET', 'CONJ', 'X', 'ADJ']
processedReviewData['token_lda']=processedReviewData['cleanText'].apply(lambda x:[token.lemma_ for token in nlp(x) if token.pos_ not in initialTagsLDA])
# rechecking for the stopwords
processedReviewData['token_lda'] = processedReviewData['token_lda'].apply(lambda p:[x for x in p if str(x.lower()) not in set(STOPWORDS) and str(x.lower()) not in stop_words and str(x.lower()) not in extraStopwords])
# rechecking the neutral words
# neutralTags = ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'PRP', 'WP', 'RB', 'RBR', 'RBS', 'IN', 'DT', 'CC']
# processedReviewData['text_lemmatized'] = processedReviewData['text_lemmatized'].apply(lambda t: [token for token in t if nltk.pos_tag([token])[0][1] not in neutralTags])
processedReviewData['token_lda']
0 [lunch, burger, meh, focus, burger, crap, pile...
1 [need, tire, wife, car, order, day, drop, morn...
2 [jim, woltman, work, goleta, honda, star, job,...
3 [time, shrimp, selection, fish, price, seafood...
4 [place, eat, need, snack, price, staff, place,...
...
4994 [think, place, buy, groupon, year, ago, arriva...
4995 [robin, reopen, tucson, mallthis, place, open,...
4996 [pizza, place, love, crust, topping, delivery,...
4997 [locate, spot, downtown, beach, door, wear, sn...
4998 [sirloin, burger, salad, sirloin, burger, chic...
Name: token_lda, Length: 4999, dtype: object
Checking out the words and their frequency
from collections import Counter
def get_all_lemmas(data):
all_lemmas = []
for tokens in data:
all_lemmas.extend(tokens)
return all_lemmas
all_lemmas = get_all_lemmas(processedReviewData.text_lemmatized)
# Count
lemmas_freq = Counter(all_lemmas)
common_lemmas = lemmas_freq.most_common(100)
print (common_lemmas, len(common_lemmas))
[('good', 2759), ('great', 2101), ('nice', 783), ('little', 602), ('bad', 490), ('new', 481), ('fresh', 445), ('small', 441), ('happy', 357), ('different', 342), ('hot', 330), ('big', 312), ('delicious', 268), ('large', 260), ('old', 259), ('busy', 226), ('special', 215), ('high', 207), ('huge', 195), ('free', 191), ('fantastic', 189), ('local', 189), ('open', 188), ('able', 183), ('extra', 167), ('attentive', 140), ('overall', 140), ('wrong', 139), ('second', 133), ('easy', 125), ('disappointed', 124), ('ready', 122), ('reasonable', 121), ('available', 120), ('short', 120), ('horrible', 116), ('entire', 114), ('terrible', 113), ('professional', 110), ('real', 109), ('hard', 108), ('comfortable', 98), ('regular', 97), ('french', 96), ('low', 91), ('expensive', 89), ('main', 82), ('red', 80), ('authentic', 80), ('italian', 79), ('black', 78), ('live', 77), ('poor', 74), ('knowledgeable', 72), ('outstanding', 72), ('white', 70), ('incredible', 70), ('green', 66), ('average', 65), ('solid', 62), ('chinese', 62), ('soft', 62), ('tiny', 60), ('healthy', 60), ('young', 59), ('true', 58), ('usual', 57), ('single', 55), ('complete', 51), ('personal', 51), ('basic', 50), ('quiet', 49), ('normal', 49), ('casual', 49), ('exceptional', 49), ('safe', 48), ('typical', 48), ('possible', 47), ('fabulous', 47), ('difficult', 47), ('generous', 47), ('satisfied', 47), ('similar', 45), ('traditional', 45), ('original', 44), ('major', 44), ('courteous', 44), ('impressed', 44), ('recent', 43), ('total', 43), ('strong', 43), ('additional', 40), ('vegetarian', 40), ('negative', 40), ('affordable', 40), ('willing', 38), ('classic', 38), ('clear', 37), ('tough', 36), ('positive', 36)] 100
plt.figure(figsize=(20,10))
#Creating the text variable
textWordCloudAfter = " ".join(cat for cat in processedReviewData['text_lemmatized'].apply(lambda review: ' '.join(review)))
# Creating word_cloud with text as argument in .generate() method
word_cloud = WordCloud(collocations = False, background_color = 'white', width=2000, height=1000).generate(textWordCloudAfter)
# Display the generated Word Cloud
plt.imshow(word_cloud, interpolation='bilinear')
plt.axis("off")
plt.show()
A Bigram takes a sentence and gives us sets of two consecutive words in the sentence. A Trigram gives sets of three consecutive words in a sentence. moreoften a phrase is having much sense than a single words, a phrase with 2 words are bigram and 3 words are trigram.
processedReviewData['text_lemmatized_ngram']=processedReviewData['cleanText'].apply(lambda x: [token.lemma_ for token in nlp(x) if token.pos_ not in ["NOUN", "PRON", 'PROPN', 'X']])
# rechecking for the stopwords
processedReviewData['text_lemmatized_ngram'] = processedReviewData['text_lemmatized_ngram'].apply(lambda p:[x for x in p if str(x.lower()) not in set(STOPWORDS) and str(x.lower()) not in stop_words and str(x.lower()) not in extraStopwords])
# rechecking the neutral words
processedReviewData['text_lemmatized_ngram'] = processedReviewData['text_lemmatized_ngram'].apply(lambda t: [token for token in t if nltk.pos_tag([token])[0][1] not in ['NN', 'NNP', 'NNS', 'NNPS', 'PRP', 'WP']])
processedReviewData['text_lemmatized_ngram']
0 [obvious, different, appear, preformed, contrary]
1 [new, special, later, ready]
2 [knowledgeable, personable, fantastic]
3 [nice, different, great]
4 [fantastic, good, good, good, friendly]
...
4994 [buy, ago, reluctantly, brazilian, bad, comple...
4995 [excited, red, reopen, close, busy, open, typi...
4996 [ultimate, friendly, busy, occasionally, extra]
4997 [great, fairly, close, walkable, accessible, n...
4998 [come, small, live, away, let, leave, busy, cl...
Name: text_lemmatized_ngram, Length: 4999, dtype: object
# Assuming 'text_lemmatized_ngram' column contains tokenized text
# Function to generate bigrams
def generate_bigrams(token):
if len(token) >= 2:
return [' '.join(t) for t in list(nltk.bigrams(token))]
else:
return []
# Function to generate trigrams
def generate_trigrams(token):
if len(token) >= 3:
return [' '.join(t) for t in list(nltk.trigrams(token))]
else:
return []
# Apply functions to create bigrams and trigrams
processedReviewData['text_bigrams'] = processedReviewData['text_lemmatized_ngram'].apply(generate_bigrams)
processedReviewData['text_trigrams'] = processedReviewData['text_lemmatized_ngram'].apply(generate_trigrams)
def get_ngrams(data, common):
lemma_ngram = []
for tokens in data:
lemma_ngram.extend(tokens)
# Count
lemmas_freq_ngram = Counter(lemma_ngram)
return lemmas_freq_ngram.most_common(common)
print("Bigrams -- \n", get_ngrams(processedReviewData['text_bigrams'], 100))
print("Trigrams -- \n", get_ngrams(processedReviewData['text_trigrams'], 50))
Bigrams --
[('good good', 243), ('great great', 185), ('good great', 143), ('pretty good', 128), ('good like', 120), ('great good', 113), ('like like', 99), ('good come', 86), ('like good', 74), ('great friendly', 71), ('come come', 69), ('come good', 66), ('great like', 62), ('good friendly', 56), ('definitely come', 55), ('friendly great', 54), ('good definitely', 52), ('come like', 52), ('let know', 51), ('like come', 49), ('great definitely', 49), ('great nice', 48), ('amazing great', 48), ('friendly good', 47), ('good little', 47), ('good amazing', 46), ('like great', 46), ('know good', 46), ('great come', 46), ('probably good', 43), ('good nice', 42), ('great little', 41), ('great amazing', 41), ('nice great', 38), ('good long', 38), ('nice good', 37), ('come great', 36), ('great fresh', 35), ('good small', 34), ('good pretty', 34), ('like know', 33), ('amazing good', 32), ('good fresh', 31), ('great happy', 30), ('know great', 29), ('good leave', 28), ('nice like', 28), ('nice nice', 28), ('fresh good', 28), ('come know', 27), ('fresh great', 27), ('good know', 26), ('bad come', 26), ('know like', 26), ('overall good', 26), ('know know', 26), ('hot good', 26), ('different good', 25), ('like little', 25), ('good bad', 25), ('good highly', 24), ('new new', 24), ('little like', 24), ('like nice', 23), ('like pretty', 23), ('like friendly', 23), ('friendly attentive', 23), ('overall great', 23), ('small good', 22), ('come pretty', 22), ('nice come', 22), ('expect good', 22), ('absolutely delicious', 22), ('know come', 22), ('good large', 22), ('great pretty', 22), ('definitely good', 22), ('come hot', 22), ('nice friendly', 22), ('far good', 22), ('come nice', 22), ('hot hot', 21), ('little great', 21), ('great small', 21), ('good overall', 21), ('long come', 21), ('bad good', 21), ('little good', 20), ('great know', 20), ('good new', 20), ('friendly come', 20), ('long good', 20), ('good different', 19), ('finally come', 19), ('like long', 19), ('great old', 19), ('leave like', 19), ('come fresh', 19), ('special good', 18), ('old like', 18)]
Trigrams --
[('great great great', 32), ('good good good', 26), ('good good great', 15), ('like pretty good', 11), ('pretty good good', 11), ('great great good', 11), ('pretty good like', 10), ('good great great', 10), ('pretty good great', 9), ('good good friendly', 8), ('great good good', 8), ('great friendly great', 8), ('good nice good', 7), ('good great good', 7), ('good good like', 7), ('good good definitely', 7), ('actually pretty good', 7), ('like great great', 7), ('great friendly good', 6), ('good good long', 6), ('know great great', 6), ('great good friendly', 6), ('amazing great definitely', 6), ('amazing good great', 6), ('like good good', 6), ('good good come', 6), ('great pretty good', 5), ('good like like', 5), ('leave like come', 5), ('know good good', 5), ('great like good', 5), ('good come good', 5), ('good know good', 5), ('great good great', 5), ('like good like', 5), ('good like pretty', 5), ('good pretty good', 5), ('come good good', 5), ('good like good', 5), ('good amazing good', 5), ('pretty good amazing', 5), ('good good little', 5), ('good friendly good', 5), ('like like like', 5), ('like come like', 5), ('pretty good little', 5), ('good great little', 4), ('come like come', 4), ('good friendly great', 4), ('great fresh great', 4)]
# import and instantiate the vectorizer
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
# vectorize the lemmatized text
bagWords = cv.fit_transform(processedReviewData['text_lemmatized'].astype(str))
bagWords.shape
(4999, 877)
Tf stands for term frequency, the number of times the word appears in each document.
Idf stands for inverse document frequency, an inverse count of the number of documents a word appears in. Idf measures how significant a word is in the whole corpus.
# import and instantiate the vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
# apply the vectorizer to the corpus
idfVector = vectorizer.fit_transform(processedReviewData['text_lemmatized'].astype(str))
# display the document-term matrix
vocab = vectorizer.get_feature_names_out()
print(idfVector.shape)
vocab
(4999, 877)
array(['able', 'academic', 'acceptable', 'accessable', 'accessible',
'accountable', 'acknowledgable', 'acoustic', 'active', 'actual',
'addictive', 'additional', 'adjustable', 'adorable', 'advanced',
'adventuous', 'adventuredrive', 'adventurous', 'aerial',
'aesthetic', 'affected', 'affordable', 'aggressive',
'agricultural', 'alcoholic', 'alive', 'alonecasual',
'alwaysavailable', 'amateurish', 'ambiguous', 'ambitious',
'american', 'americanitalian', 'amicable', 'amish', 'anal',
'angry', 'annual', 'anxious', 'apathetic', 'apocalyptic',
'apologetic', 'appalled', 'applicable', 'appreciative',
'apprehensive', 'approachable', 'appy', 'aquatic', 'architectural',
'argentinian', 'armored', 'arrive', 'arrogant', 'artful',
'artificial', 'artistic', 'asian', 'asthmatic', 'astronomical',
'athenian', 'athletic', 'atrocious', 'attentive', 'attractive',
'audible', 'australian', 'authentic', 'autistic', 'automatic',
'auxiliary', 'available', 'average', 'averagetypical', 'avian',
'aware', 'bad', 'barnesnoble', 'basic', 'belgian', 'best', 'big',
'bilingual', 'billion', 'biodegradable', 'black', 'blasphemous',
'boisterous', 'bombulicious', 'brazilian', 'british', 'broad',
'bureaucratic', 'busy', 'capable', 'casual', 'catastrophic',
'cathedral', 'cautious', 'cdelicious', 'central', 'ceramic',
'certain', 'chaotic', 'charismatic', 'charitable', 'chic',
'chinese', 'chiropractic', 'chronological', 'circuitous',
'citizenmoral', 'civil', 'classic', 'classical', 'clear',
'clinical', 'cobble', 'colombian', 'comfortable', 'comic',
'comical', 'commercial', 'common', 'comparable', 'competitive',
'complete', 'complex', 'composite', 'concerned', 'configurable',
'confortable', 'conscious', 'consecutive', 'conservative',
'considerable', 'conspicuous', 'constant', 'contagious',
'contemporary', 'continued', 'contrarian', 'contrary',
'conventional', 'copious', 'corporate', 'cosmetic', 'costly',
'courteous', 'cous', 'cozyromanticrustic', 'crappy', 'creative',
'criminal', 'critical', 'cultural', 'curious', 'current',
'customary', 'customizable', 'cylindrical', 'daily', 'dangerous',
'dead', 'decipherable', 'deductible', 'deeeeeeelicious',
'defective', 'definitive', 'delectable', 'delicious', 'delighted',
'demographic', 'dependable', 'deplorable', 'desirable', 'detailed',
'diabetic', 'diagnostic', 'dian', 'dietary', 'different',
'difficult', 'direct', 'disabled', 'disappointed', 'disposable',
'dissatisfied', 'distinguishable', 'diuretic', 'doable',
'domestic', 'dooable', 'draconian', 'dramatic', 'drinkable',
'drinkingable', 'drinksmiscellaneous', 'drippy', 'drivable',
'dynamic', 'earthconscious', 'eastern', 'easy', 'eatable',
'eccentric', 'eclectic', 'economical', 'ecstatic', 'ecuadorian',
'edible', 'educational', 'effective', 'egregious', 'electric',
'electrical', 'electronic', 'elementary', 'elusive', 'emblematic',
'emotional', 'empathetic', 'energetic', 'english', 'enjoyable',
'enjoyedive', 'enormous', 'entensive', 'enthusiastic', 'entire',
'environmental', 'equal', 'erratic', 'especial', 'essential',
'eternal', 'ethical', 'ethiopian', 'ethnic', 'ethnicnational',
'ethopian', 'european', 'everpretentious', 'exceptional',
'excessive', 'exclusive', 'exemplary', 'exhaustive', 'existential',
'exotic', 'expansive', 'expensive', 'experienced', 'extensive',
'extra', 'extraordinary', 'fabulous', 'facial', 'factual', 'false',
'familiar', 'famous', 'fanatic', 'fantastic', 'fashionable',
'fastcasual', 'favorable', 'federal', 'festive', 'final',
'financial', 'fixable', 'flat', 'flexible', 'floppy', 'floral',
'focal', 'fondue', 'foodive', 'foolish', 'foreign', 'foreseeable',
'forgettable', 'formal', 'formidable', 'fourth', 'fractional',
'free', 'french', 'fresh', 'functional', 'furious', 'geneous',
'general', 'generous', 'german', 'gigantic', 'ginormous',
'glamorous', 'global', 'gloppy', 'glorious', 'gneeral', 'golden',
'good', 'gorgeous', 'gracious', 'grand', 'graphic', 'great',
'green', 'gross', 'guilty', 'happy', 'hard', 'hawaiian', 'healthy',
'heavy', 'hectic', 'heretypical', 'hermetic', 'hideous', 'high',
'hilarious', 'hippy', 'hispanic', 'historical', 'honorable',
'hoppy', 'horrendous', 'horrible', 'horticultural', 'hospitable',
'hot', 'huge', 'humble', 'humorous', 'hypersexual', 'hypothetical',
'iced', 'identical', 'illegal', 'imaginative', 'immersive',
'impeccable', 'imperial', 'important', 'impossible', 'impractical',
'impressed', 'impressive', 'inattentive', 'incapable', 'inclined',
'inclusive', 'incredible', 'indecisive', 'independent',
'indescribable', 'indian', 'indicative', 'individual',
'industrial', 'inedible', 'ineffective', 'inevitable',
'inexcusable', 'inexpensive', 'inexperienced', 'inexplicable',
'infamous', 'infectious', 'inflexible', 'influential', 'informal',
'informational', 'informative', 'ingenious', 'initial',
'innocuous', 'innovative', 'inoperable', 'institutional',
'instructional', 'instrumental', 'insultingive', 'insurmountable',
'intact', 'intensive', 'intentional', 'interactive', 'interested',
'internal', 'international', 'intrusive', 'iralian', 'irish',
'irreplaceable', 'irritable', 'irritated', 'isolated', 'israeli',
'italian', 'japanese', 'jealous', 'knowledgable', 'knowledgeable',
'lackadaisical', 'large', 'laughable', 'lavish', 'lebanese',
'legal', 'legendary', 'legislative', 'lest', 'liberal', 'likely',
'limited', 'literal', 'little', 'live', 'livemusic',
'livingsocial', 'local', 'longish', 'loose', 'low', 'lucky',
'lunchcomplimentary', 'lunchvegetarian', 'luscious', 'luxurious',
'magical', 'main', 'majestic', 'major', 'manageable', 'manual',
'manyunbelievable', 'married', 'marvelous', 'masochistic',
'massive', 'mechanical', 'medical', 'memorable', 'meticulous',
'mexican', 'microwaveable', 'military', 'million', 'minimalistic',
'miraculous', 'miserable', 'mixed', 'modern', 'modest',
'monotonous', 'monstrous', 'moral', 'moroccan', 'municipal',
'musical', 'mysterious', 'naked', 'nasty', 'national',
'nationwide', 'native', 'natural', 'naturalistic', 'nearest',
'necessary', 'negative', 'neglectful', 'nervous', 'neurological',
'neutral', 'new', 'newagepostmodern', 'nice', 'nitrous', 'noble',
'nomadfive', 'nominal', 'nonalcoholic', 'nonfunctional',
'nonintrusive', 'noninvasive', 'nonrefundable', 'nonreturnable',
'nonvegetarian', 'nonverbal', 'normal', 'northern', 'norwegian',
'notable', 'notary', 'noteworthy', 'noticeable', 'notorious',
'noxious', 'nuclear', 'numerous', 'nutritional', 'oblivious',
'obnoxious', 'obvious', 'occasional', 'offensive', 'ohsocrucial',
'oilbalasmic', 'old', 'olive', 'open', 'opentable', 'operational',
'optimistic', 'optional', 'oral', 'ordinary', 'organic',
'orgasmic', 'oriental', 'original', 'orleanian', 'ostentatious',
'outrageous', 'outstanding', 'overall', 'overdue', 'overfive',
'overwhelmed', 'palatable', 'parisian', 'partial', 'particular',
'passable', 'passible', 'pathetic', 'personable', 'personal',
'peruvian', 'phantasmagorical', 'physical', 'pittsburghian',
'pleased', 'pleasurable', 'polish', 'political', 'poor', 'poppy',
'popular', 'portable', 'portuguese', 'positive', 'possible',
'potatoarugulacheese', 'potential', 'powerful', 'practical',
'precious', 'predictable', 'preliminary', 'prepared',
'pretentious', 'previous', 'private', 'problematic',
'professional', 'prophetic', 'prosthetic', 'provencal',
'puddingfantastic', 'punctual', 'questionable', 'quiet',
'quintessential', 'racial', 'raucous', 'ready', 'real',
'realistic', 'reasonable', 'recent', 'recyclable', 'red',
'refundable', 'regional', 'regrettable', 'regular', 'relatable',
'related', 'reliable', 'religious', 'remarkable', 'reputable',
'residential', 'residual', 'respectable', 'responsible', 'retail',
'reusable', 'revolutionary', 'rican', 'rich', 'rid', 'ridiculous',
'righteous', 'romantic', 'rural', 'russian', 'rustic', 'safe',
'salvadorian', 'sanitary', 'sarcastic', 'satisfied',
'saturdayincredible', 'saucecheese', 'scan', 'scandinavian',
'scary', 'scientific', 'scottish', 'scrumptious', 'seasonal',
'seatingcomplimentary', 'second', 'sectional', 'seinfeldian',
'semiauthentic', 'semiinconspicuous', 'senior', 'sensitive',
'separate', 'serviceable', 'seven', 'severe', 'sexual', 'sharable',
'shareable', 'sharp', 'short', 'shrimpcheckbavarian', 'siamese',
'sichuanese', 'sicilian', 'significant', 'similar', 'simplistic',
'single', 'sizable', 'skeptical', 'sloppy', 'small', 'snappy',
'snippy', 'socal', 'sociable', 'social', 'soft', 'solid',
'sophisticated', 'southern', 'spacious', 'spanish', 'spastic',
'special', 'specialized', 'specific', 'spiritual', 'spontaneous',
'spreadable', 'stable', 'starsmusic', 'stationary', 'steady',
'strategic', 'strenuous', 'strong', 'stupendous', 'stupid',
'substantial', 'successful', 'sudden', 'sugary', 'suitable',
'sumptuous', 'superior', 'supernatural', 'surgical', 'surprised',
'suspicious', 'sustainable', 'swedish', 'swiss', 'symmetrical',
'sympathetic', 'synonymous', 'synthetic', 'taiwanese', 'tastic',
'tawainese', 'technical', 'temporary', 'tenuous', 'terrible',
'textural', 'thailaotian', 'therapeutic', 'thetable', 'thoughtful',
'tiny', 'tolerable', 'topadrian', 'total', 'touchable', 'tough',
'tradional', 'traditional', 'tragicomic', 'transamerican',
'transformational', 'traumatic', 'treacherous', 'tremendous',
'tropical', 'troubled', 'true', 'tuscan', 'typical', 'ukrainian',
'ultimate', 'ultra', 'unable', 'unacceptable', 'unannounced',
'unapologetic', 'unapproachable', 'unattended', 'unattentive',
'unattractive', 'unavailable', 'unbearable', 'unbeatable',
'unbelievable', 'unbiased', 'unborn', 'uncanny', 'unclean',
'unclear', 'unclogged', 'uncomfortable', 'uncommon', 'unconcerned',
'unconventional', 'uncorked', 'undeniable', 'undercooked',
'underdressed', 'underrated', 'undersalted', 'underseasoned',
'understandable', 'undertrained', 'underwhelmed', 'undivided',
'uneasy', 'uneaten', 'uneducated', 'unenthusiastic', 'unethical',
'uneventful', 'unexpected', 'unflavored', 'unfounded', 'unglued',
'ungrateful', 'unhappy', 'unhealthy', 'unheard', 'unhelpful',
'uni', 'unicorn', 'unidentifiable', 'unidentified', 'unimpressed',
'uninformed', 'uninterested', 'unknown', 'unlikely', 'unlimited',
'unlucky', 'unmanned', 'unmarked', 'unmelted', 'unmemorable',
'unmitigated', 'unnecessary', 'unnoticed', 'unobtrusive',
'unorganized', 'unpacked', 'unpalatable', 'unpaved', 'unpleasant',
'unpredictable', 'unprepared', 'unprofessional', 'unproffessional',
'unreasonable', 'unremarkable', 'unresponsive', 'unrivaled',
'unsafe', 'unsanitary', 'unsatisfied', 'unseasoned', 'unseen',
'unserved', 'unsimilar', 'unspectacular', 'unsuccessful',
'unsweetened', 'untoasted', 'untouched', 'untrained', 'untreated',
'untrue', 'unusable', 'unusual', 'unwanted', 'unwarranted',
'unwilling', 'unwrapped', 'uplandwant', 'upper', 'upright',
'upsale', 'upscale', 'uptown', 'urban', 'useable', 'useful',
'usual', 'valid', 'valuable', 'vegetarian', 'venetian',
'veterinarian', 'viable', 'victorian', 'vietnamese', 'viewable',
'vigorous', 'virtual', 'visible', 'walkable', 'wary',
'wastedtypical', 'weak', 'weary', 'weekly', 'western', 'whimsical',
'white', 'whogotwhat', 'wide', 'widespread', 'willing', 'wondrous',
'workpersonal', 'worried', 'wrong', 'young'], dtype=object)
from gensim.models import FastText
model_ted = FastText(processedReviewData['text_lemmatized'], vector_size=500, window=3, min_count=3, workers=4,sg=1)
wordFastText = pd.concat([pd.DataFrame(model_ted.wv.index_to_key, columns=['words']), pd.DataFrame(model_ted.wv.vectors)], axis=1)
wordFastText
| words | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | 490 | 491 | 492 | 493 | 494 | 495 | 496 | 497 | 498 | 499 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | good | 0.007585 | 0.020563 | 0.031869 | -0.023586 | -0.092026 | 0.118824 | -0.010433 | 0.071385 | 0.048883 | ... | 0.015999 | -0.053557 | 0.086451 | 0.079166 | 0.017622 | -0.039000 | -0.011855 | 0.017838 | 0.021071 | -0.063291 |
| 1 | great | 0.006736 | 0.021745 | 0.031971 | -0.023755 | -0.092718 | 0.120262 | -0.011136 | 0.071410 | 0.048961 | ... | 0.016211 | -0.053955 | 0.087280 | 0.080063 | 0.017959 | -0.039566 | -0.012185 | 0.017989 | 0.021260 | -0.063534 |
| 2 | nice | 0.006962 | 0.020661 | 0.031053 | -0.022866 | -0.092167 | 0.118266 | -0.011579 | 0.070583 | 0.048773 | ... | 0.014773 | -0.052688 | 0.086204 | 0.078155 | 0.017732 | -0.039316 | -0.012118 | 0.017011 | 0.020682 | -0.061572 |
| 3 | little | 0.006871 | 0.021982 | 0.031937 | -0.023956 | -0.094171 | 0.121453 | -0.011230 | 0.072278 | 0.049842 | ... | 0.016465 | -0.054544 | 0.088703 | 0.081260 | 0.018051 | -0.040710 | -0.012799 | 0.018143 | 0.021881 | -0.064991 |
| 4 | bad | 0.006066 | 0.020306 | 0.030078 | -0.023124 | -0.087307 | 0.114183 | -0.011509 | 0.068787 | 0.047072 | ... | 0.015191 | -0.050275 | 0.082482 | 0.074933 | 0.017196 | -0.038018 | -0.011489 | 0.015674 | 0.019779 | -0.060988 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 407 | predictable | 0.005787 | 0.019836 | 0.029363 | -0.022185 | -0.085495 | 0.111149 | -0.010124 | 0.066129 | 0.045813 | ... | 0.015303 | -0.049326 | 0.080736 | 0.073998 | 0.016647 | -0.036347 | -0.011511 | 0.016032 | 0.019480 | -0.058990 |
| 408 | asthmatic | 0.005678 | 0.017467 | 0.026198 | -0.019846 | -0.075831 | 0.098743 | -0.009328 | 0.058476 | 0.040706 | ... | 0.013497 | -0.043997 | 0.071335 | 0.065167 | 0.014708 | -0.032377 | -0.010231 | 0.014353 | 0.017530 | -0.051897 |
| 409 | inevitable | 0.007030 | 0.022028 | 0.032665 | -0.024929 | -0.095467 | 0.124766 | -0.011445 | 0.074270 | 0.051407 | ... | 0.017394 | -0.055469 | 0.090648 | 0.083029 | 0.018738 | -0.040776 | -0.012426 | 0.018116 | 0.021997 | -0.065973 |
| 410 | cautious | 0.006735 | 0.021391 | 0.031001 | -0.023805 | -0.092015 | 0.119459 | -0.010759 | 0.070910 | 0.049286 | ... | 0.016081 | -0.053000 | 0.086676 | 0.079199 | 0.017659 | -0.038975 | -0.012009 | 0.017500 | 0.021157 | -0.063003 |
| 411 | inclusive | 0.005418 | 0.017761 | 0.027105 | -0.020305 | -0.077979 | 0.101299 | -0.009176 | 0.060560 | 0.041862 | ... | 0.013913 | -0.044960 | 0.073105 | 0.066813 | 0.015029 | -0.032807 | -0.010161 | 0.014970 | 0.017935 | -0.053572 |
412 rows × 501 columns
from gensim.models.phrases import Phrases, Phraser, ENGLISH_CONNECTOR_WORDS
def bigram2vec(unigrams):
bigram = Phraser(Phrases(unigrams, min_count=3, connector_words=ENGLISH_CONNECTOR_WORDS))
trigram = Phraser(Phrases(bigram[unigrams], min_count=1, connector_words=ENGLISH_CONNECTOR_WORDS))
return FastText(trigram[bigram[unigrams]], min_count=3,vector_size=500)
resBigram = bigram2vec(processedReviewData['text_lemmatized'])
FastTextGram = pd.concat([pd.DataFrame(resBigram.wv.index_to_key, columns=['words']), pd.DataFrame(resBigram.wv.vectors)], axis=1)
FastTextGram
| words | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | 490 | 491 | 492 | 493 | 494 | 495 | 496 | 497 | 498 | 499 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | good | -0.004363 | 0.059130 | -0.033575 | -0.015365 | -0.055984 | 0.229594 | 0.101116 | 0.099879 | 0.124248 | ... | 0.035882 | -0.106193 | 0.142355 | 0.153548 | 0.023196 | -0.068305 | -0.101732 | -0.012262 | 0.025996 | -0.063845 |
| 1 | great | -0.005083 | 0.055807 | -0.031517 | -0.014320 | -0.051816 | 0.213626 | 0.093615 | 0.092167 | 0.115060 | ... | 0.033472 | -0.099019 | 0.132674 | 0.143456 | 0.021747 | -0.063909 | -0.095276 | -0.011507 | 0.024223 | -0.059054 |
| 2 | nice | -0.004621 | 0.057925 | -0.033440 | -0.014672 | -0.055553 | 0.224275 | 0.097540 | 0.097148 | 0.121094 | ... | 0.033674 | -0.103686 | 0.139565 | 0.149776 | 0.022943 | -0.067398 | -0.099848 | -0.012316 | 0.025370 | -0.060970 |
| 3 | little | -0.005094 | 0.060075 | -0.034196 | -0.015304 | -0.056058 | 0.229196 | 0.100533 | 0.098904 | 0.123695 | ... | 0.035680 | -0.106335 | 0.142592 | 0.153924 | 0.023370 | -0.069184 | -0.102532 | -0.012338 | 0.026522 | -0.063816 |
| 4 | bad | -0.005169 | 0.061367 | -0.034739 | -0.016215 | -0.056921 | 0.235274 | 0.102217 | 0.102554 | 0.127276 | ... | 0.036412 | -0.108640 | 0.146176 | 0.156994 | 0.024103 | -0.070742 | -0.104594 | -0.013407 | 0.026801 | -0.065800 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 410 | unremarkable | -0.003312 | 0.040384 | -0.022834 | -0.010332 | -0.037548 | 0.154411 | 0.068190 | 0.066807 | 0.083270 | ... | 0.024341 | -0.071124 | 0.095956 | 0.103041 | 0.015808 | -0.046174 | -0.068809 | -0.008217 | 0.017747 | -0.042761 |
| 411 | intrusive | -0.003504 | 0.039752 | -0.022503 | -0.010556 | -0.037427 | 0.153134 | 0.067137 | 0.066405 | 0.082567 | ... | 0.023477 | -0.070610 | 0.095049 | 0.102024 | 0.015873 | -0.045570 | -0.068041 | -0.008028 | 0.017643 | -0.042493 |
| 412 | laughable | -0.004042 | 0.049433 | -0.027913 | -0.012833 | -0.046223 | 0.189718 | 0.083576 | 0.082299 | 0.102606 | ... | 0.029824 | -0.087221 | 0.117993 | 0.127240 | 0.019594 | -0.056342 | -0.084293 | -0.010275 | 0.021348 | -0.052545 |
| 413 | social_safe | -0.001579 | 0.021694 | -0.012124 | -0.005535 | -0.019793 | 0.082633 | 0.036100 | 0.035450 | 0.044777 | ... | 0.012776 | -0.038453 | 0.051318 | 0.055270 | 0.008523 | -0.024798 | -0.036753 | -0.004801 | 0.009108 | -0.023116 |
| 414 | avian | -0.003986 | 0.042184 | -0.024468 | -0.010726 | -0.039081 | 0.160163 | 0.070256 | 0.069091 | 0.086504 | ... | 0.024721 | -0.073872 | 0.099148 | 0.107591 | 0.016270 | -0.047616 | -0.071266 | -0.009398 | 0.018163 | -0.044432 |
415 rows × 501 columns
Polarity determines the sentiment of the text. Its values lie in [-1,1] where -1 denotes a highly negative sentiment and 1 denotes a highly positive sentiment.
Subjectivity determines whether a text input is factual information or a personal opinion. Its value lies between [0,1] where a value closer to 0 denotes a piece of factual information and a value closer to 1 denotes a personal opinion.
from textblob import TextBlob
processedReviewData['text_polarity']= processedReviewData['cleanText'].apply(lambda x: TextBlob(x).sentiment.polarity)
processedReviewData['text_subjectivity']= processedReviewData['cleanText'].apply(lambda x: TextBlob(x).sentiment.subjectivity)
processedReviewData['textBlobSentiments'] = np.where(processedReviewData['text_polarity']>0, 1, 0)
fig = plt.figure(figsize=(20, 5), tight_layout=True)
plt.subplot(1, 2, 1)
textSent = sns.countplot(x='stars', hue='textBlobSentiments', data=processedReviewData)
for p in textSent.patches:
txt = str(p.get_height())
txt_x = p.get_x()
txt_y = p.get_height()
textSent.text(txt_x, txt_y, txt, size=14)
plt.title("Textblob sentiment analysis for review")
plt.xlabel("Review stars (ratings)")
plt.ylabel("Number of reviews")
plt.legend(["Negative", "Positive"])
plt.subplot(1, 2, 2)
sns.kdeplot(data=processedReviewData, x='text_polarity', hue='stars', palette="Set1")
plt.title("Textblob sentiment polarity distribution")
plt.xlabel("Polarity")
Text(0.5, 0, 'Polarity')
According to the polarity values, 981 reviews are negative or close to negative.
The Compound score is a metric that calculates the sum of all the lexicon ratings which have been normalized between -1(most extreme negative) and +1 (most extreme positive).\ positive sentiment : (compound score >= 0.05) \ neutral sentiment : (compound score > -0.05) and (compound score < 0.05) \ negative sentiment : (compound score <= -0.05)
pip install vaderSentiment
Collecting vaderSentiment
Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 126.0/126.0 kB 6.1 MB/s eta 0:00:00
Requirement already satisfied: requests in /opt/conda/lib/python3.10/site-packages (from vaderSentiment) (2.31.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests->vaderSentiment) (3.1.0)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests->vaderSentiment) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/conda/lib/python3.10/site-packages (from requests->vaderSentiment) (1.26.15)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests->vaderSentiment) (2023.7.22)
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2
Note: you may need to restart the kernel to use updated packages.
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
sentiment = SentimentIntensityAnalyzer()
processedReviewData['vadar_polarity']= processedReviewData['cleanText'].apply(lambda x: sentiment.polarity_scores(x)['compound'])
processedReviewData['vadarSentiments'] = np.where(processedReviewData['vadar_polarity']>0, 1, 0)
fig = plt.figure(figsize=(20, 5), tight_layout=True)
plt.subplot(1, 2, 1)
textSent = sns.countplot(x='stars', hue='vadarSentiments', data=processedReviewData)
for p in textSent.patches:
txt = str(p.get_height())
txt_x = p.get_x() + 0.1
txt_y = p.get_height()
textSent.text(txt_x, txt_y, txt, size=14)
plt.title("VADAR sentiment analysis for review")
plt.xlabel("Review stars (ratings)")
plt.ylabel("Number of reviews")
plt.legend(["Negative", "Positive"])
plt.subplot(1, 2, 2)
sns.kdeplot(data=processedReviewData, x='vadar_polarity', hue='stars', palette="Set1")
plt.title("VADAR sentiment polarity distribution")
plt.xlabel("Polarity")
Text(0.5, 0, 'Polarity')
pip install flair
Collecting flair
Downloading flair-0.13.0-py3-none-any.whl (387 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 387.2/387.2 kB 14.1 MB/s eta 0:00:00
Requirement already satisfied: boto3>=1.20.27 in /opt/conda/lib/python3.10/site-packages (from flair) (1.26.100)
Collecting bpemb>=0.3.2 (from flair)
Downloading bpemb-0.3.4-py3-none-any.whl (19 kB)
Collecting conllu>=4.0 (from flair)
Downloading conllu-4.5.3-py2.py3-none-any.whl (16 kB)
Requirement already satisfied: deprecated>=1.2.13 in /opt/conda/lib/python3.10/site-packages (from flair) (1.2.14)
Collecting ftfy>=6.1.0 (from flair)
Downloading ftfy-6.1.3-py3-none-any.whl (53 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.4/53.4 kB 3.9 MB/s eta 0:00:00
Collecting gdown>=4.4.0 (from flair)
Downloading gdown-4.7.1-py3-none-any.whl (15 kB)
Requirement already satisfied: gensim>=4.2.0 in /opt/conda/lib/python3.10/site-packages (from flair) (4.3.2)
Requirement already satisfied: huggingface-hub>=0.10.0 in /opt/conda/lib/python3.10/site-packages (from flair) (0.16.4)
Requirement already satisfied: janome>=0.4.2 in /opt/conda/lib/python3.10/site-packages (from flair) (0.5.0)
Requirement already satisfied: langdetect>=1.0.9 in /opt/conda/lib/python3.10/site-packages (from flair) (1.0.9)
Requirement already satisfied: lxml>=4.8.0 in /opt/conda/lib/python3.10/site-packages (from flair) (4.9.3)
Requirement already satisfied: matplotlib>=2.2.3 in /opt/conda/lib/python3.10/site-packages (from flair) (3.7.2)
Requirement already satisfied: more-itertools>=8.13.0 in /opt/conda/lib/python3.10/site-packages (from flair) (9.1.0)
Requirement already satisfied: mpld3>=0.3 in /opt/conda/lib/python3.10/site-packages (from flair) (0.5.9)
Collecting pptree>=3.1 (from flair)
Downloading pptree-3.1.tar.gz (3.0 kB)
Preparing metadata (setup.py) ... done
Requirement already satisfied: python-dateutil>=2.8.2 in /opt/conda/lib/python3.10/site-packages (from flair) (2.8.2)
Collecting pytorch-revgrad>=0.2.0 (from flair)
Downloading pytorch_revgrad-0.2.0-py3-none-any.whl (4.6 kB)
Requirement already satisfied: regex>=2022.1.18 in /opt/conda/lib/python3.10/site-packages (from flair) (2023.6.3)
Requirement already satisfied: scikit-learn>=1.0.2 in /opt/conda/lib/python3.10/site-packages (from flair) (1.2.2)
Collecting segtok>=1.5.11 (from flair)
Downloading segtok-1.5.11-py3-none-any.whl (24 kB)
Collecting sqlitedict>=2.0.0 (from flair)
Downloading sqlitedict-2.1.0.tar.gz (21 kB)
Preparing metadata (setup.py) ... done
Requirement already satisfied: tabulate>=0.8.10 in /opt/conda/lib/python3.10/site-packages (from flair) (0.9.0)
Requirement already satisfied: torch!=1.8,>=1.5.0 in /opt/conda/lib/python3.10/site-packages (from flair) (2.0.0+cpu)
Requirement already satisfied: tqdm>=4.63.0 in /opt/conda/lib/python3.10/site-packages (from flair) (4.66.1)
Collecting transformer-smaller-training-vocab>=0.2.3 (from flair)
Downloading transformer_smaller_training_vocab-0.3.3-py3-none-any.whl (14 kB)
Requirement already satisfied: transformers[sentencepiece]<5.0.0,>=4.18.0 in /opt/conda/lib/python3.10/site-packages (from flair) (4.33.0)
Requirement already satisfied: urllib3<2.0.0,>=1.0.0 in /opt/conda/lib/python3.10/site-packages (from flair) (1.26.15)
Collecting wikipedia-api>=0.5.7 (from flair)
Downloading Wikipedia_API-0.6.0-py3-none-any.whl (14 kB)
Requirement already satisfied: semver<4.0.0,>=3.0.0 in /opt/conda/lib/python3.10/site-packages (from flair) (3.0.1)
Collecting botocore<1.30.0,>=1.29.100 (from boto3>=1.20.27->flair)
Downloading botocore-1.29.165-py3-none-any.whl (11.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11.0/11.0 MB 59.4 MB/s eta 0:00:0000:0100:01
Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /opt/conda/lib/python3.10/site-packages (from boto3>=1.20.27->flair) (1.0.1)
Requirement already satisfied: s3transfer<0.7.0,>=0.6.0 in /opt/conda/lib/python3.10/site-packages (from boto3>=1.20.27->flair) (0.6.2)
Requirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages (from bpemb>=0.3.2->flair) (1.23.5)
Requirement already satisfied: requests in /opt/conda/lib/python3.10/site-packages (from bpemb>=0.3.2->flair) (2.31.0)
Requirement already satisfied: sentencepiece in /opt/conda/lib/python3.10/site-packages (from bpemb>=0.3.2->flair) (0.1.99)
Requirement already satisfied: wrapt<2,>=1.10 in /opt/conda/lib/python3.10/site-packages (from deprecated>=1.2.13->flair) (1.14.1)
Collecting wcwidth<0.3.0,>=0.2.12 (from ftfy>=6.1.0->flair)
Downloading wcwidth-0.2.12-py2.py3-none-any.whl (34 kB)
Requirement already satisfied: filelock in /opt/conda/lib/python3.10/site-packages (from gdown>=4.4.0->flair) (3.12.2)
Requirement already satisfied: six in /opt/conda/lib/python3.10/site-packages (from gdown>=4.4.0->flair) (1.16.0)
Requirement already satisfied: beautifulsoup4 in /opt/conda/lib/python3.10/site-packages (from gdown>=4.4.0->flair) (4.12.2)
Requirement already satisfied: scipy>=1.7.0 in /opt/conda/lib/python3.10/site-packages (from gensim>=4.2.0->flair) (1.11.2)
Requirement already satisfied: smart-open>=1.8.1 in /opt/conda/lib/python3.10/site-packages (from gensim>=4.2.0->flair) (6.3.0)
Requirement already satisfied: fsspec in /opt/conda/lib/python3.10/site-packages (from huggingface-hub>=0.10.0->flair) (2023.9.0)
Requirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.10/site-packages (from huggingface-hub>=0.10.0->flair) (6.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/conda/lib/python3.10/site-packages (from huggingface-hub>=0.10.0->flair) (4.6.3)
Requirement already satisfied: packaging>=20.9 in /opt/conda/lib/python3.10/site-packages (from huggingface-hub>=0.10.0->flair) (21.3)
Requirement already satisfied: contourpy>=1.0.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib>=2.2.3->flair) (1.1.0)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.10/site-packages (from matplotlib>=2.2.3->flair) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib>=2.2.3->flair) (4.40.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib>=2.2.3->flair) (1.4.4)
Requirement already satisfied: pillow>=6.2.0 in /opt/conda/lib/python3.10/site-packages (from matplotlib>=2.2.3->flair) (9.5.0)
Requirement already satisfied: pyparsing<3.1,>=2.3.1 in /opt/conda/lib/python3.10/site-packages (from matplotlib>=2.2.3->flair) (3.0.9)
Requirement already satisfied: jinja2 in /opt/conda/lib/python3.10/site-packages (from mpld3>=0.3->flair) (3.1.2)
Requirement already satisfied: joblib>=1.1.1 in /opt/conda/lib/python3.10/site-packages (from scikit-learn>=1.0.2->flair) (1.3.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.10/site-packages (from scikit-learn>=1.0.2->flair) (3.1.0)
Requirement already satisfied: sympy in /opt/conda/lib/python3.10/site-packages (from torch!=1.8,>=1.5.0->flair) (1.12)
Requirement already satisfied: networkx in /opt/conda/lib/python3.10/site-packages (from torch!=1.8,>=1.5.0->flair) (3.1)
Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /opt/conda/lib/python3.10/site-packages (from transformers[sentencepiece]<5.0.0,>=4.18.0->flair) (0.13.3)
Requirement already satisfied: safetensors>=0.3.1 in /opt/conda/lib/python3.10/site-packages (from transformers[sentencepiece]<5.0.0,>=4.18.0->flair) (0.3.3)
Requirement already satisfied: protobuf in /opt/conda/lib/python3.10/site-packages (from transformers[sentencepiece]<5.0.0,>=4.18.0->flair) (3.20.3)
Requirement already satisfied: accelerate>=0.20.3 in /opt/conda/lib/python3.10/site-packages (from transformers[sentencepiece]<5.0.0,>=4.18.0->flair) (0.22.0)
Requirement already satisfied: soupsieve>1.2 in /opt/conda/lib/python3.10/site-packages (from beautifulsoup4->gdown>=4.4.0->flair) (2.3.2.post1)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/lib/python3.10/site-packages (from jinja2->mpld3>=0.3->flair) (2.1.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests->bpemb>=0.3.2->flair) (3.1.0)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests->bpemb>=0.3.2->flair) (3.4)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests->bpemb>=0.3.2->flair) (2023.7.22)
Requirement already satisfied: PySocks!=1.5.7,>=1.5.6 in /opt/conda/lib/python3.10/site-packages (from requests->bpemb>=0.3.2->flair) (1.7.1)
Requirement already satisfied: mpmath>=0.19 in /opt/conda/lib/python3.10/site-packages (from sympy->torch!=1.8,>=1.5.0->flair) (1.3.0)
Requirement already satisfied: psutil in /opt/conda/lib/python3.10/site-packages (from accelerate>=0.20.3->transformers[sentencepiece]<5.0.0,>=4.18.0->flair) (5.9.3)
Building wheels for collected packages: pptree, sqlitedict
Building wheel for pptree (setup.py) ... done
Created wheel for pptree: filename=pptree-3.1-py3-none-any.whl size=4609 sha256=f6ff97cb291db14102596680141c270807c66bff2fecbc5b71d8fbbd70922714
Stored in directory: /root/.cache/pip/wheels/9f/b6/0e/6f26eb9e6eb53ff2107a7888d72b5a6a597593956113037828
Building wheel for sqlitedict (setup.py) ... done
Created wheel for sqlitedict: filename=sqlitedict-2.1.0-py3-none-any.whl size=16863 sha256=2e3c61ffb342e59bc49446cf152fa845856be7868075ef96e06ea9f5fa085a6a
Stored in directory: /root/.cache/pip/wheels/79/d6/e7/304e0e6cb2221022c26d8161f7c23cd4f259a9e41e8bbcfabd
Successfully built pptree sqlitedict
Installing collected packages: wcwidth, sqlitedict, pptree, segtok, ftfy, conllu, wikipedia-api, botocore, pytorch-revgrad, gdown, bpemb, transformer-smaller-training-vocab, flair
Attempting uninstall: wcwidth
Found existing installation: wcwidth 0.2.6
Uninstalling wcwidth-0.2.6:
Successfully uninstalled wcwidth-0.2.6
Attempting uninstall: botocore
Found existing installation: botocore 1.31.17
Uninstalling botocore-1.31.17:
Successfully uninstalled botocore-1.31.17
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
aiobotocore 2.5.4 requires botocore<1.31.18,>=1.31.17, but you have botocore 1.29.165 which is incompatible.
Successfully installed botocore-1.29.165 bpemb-0.3.4 conllu-4.5.3 flair-0.13.0 ftfy-6.1.3 gdown-4.7.1 pptree-3.1 pytorch-revgrad-0.2.0 segtok-1.5.11 sqlitedict-2.1.0 transformer-smaller-training-vocab-0.3.3 wcwidth-0.2.12 wikipedia-api-0.6.0
Note: you may need to restart the kernel to use updated packages.
from flair.models import TextClassifier
from flair.data import Sentence
sia = TextClassifier.load('en-sentiment')
def flair_prediction(x):
sentence = Sentence(x)
sia.predict(sentence)
return sentence.labels[0].to_dict()
resultsFlair = pd.Series(processedReviewData["cleanText"].apply(flair_prediction))
processedReviewData["flairPolarity"] = resultsFlair.apply(lambda con : con['confidence'])
processedReviewData["flairSentiment"] = resultsFlair.apply(lambda con : 1 if con['value']=="POSITIVE" else 0)
fig = plt.figure(figsize=(20, 5), tight_layout=True)
plt.subplot(1, 2, 1)
textSent = sns.countplot(x='stars', hue='flairSentiment', data=processedReviewData)
for p in textSent.patches:
txt = str(p.get_height())
txt_x = p.get_x() + 0.1
txt_y = p.get_height()
textSent.text(txt_x, txt_y, txt, size=14)
plt.title("Flair sentiment analysis for review")
plt.xlabel("Review stars (ratings)")
plt.ylabel("Number of reviews")
plt.legend(["Negative", "Positive"])
plt.subplot(1, 2, 2)
sns.kdeplot(data=processedReviewData, x='flairPolarity', hue='stars', palette="Set1")
plt.title("Flair sentiment polarity distribution")
plt.xlabel("Polarity")
Text(0.5, 0, 'Polarity')
from sklearn.decomposition import PCA
# function to verify the variance acquired by the PC
def pcaFunction(data,numberOfComponent):
pca=PCA(n_components=numberOfComponent)
pca.fit(data)
scree = pca.explained_variance_ratio_*100
plt.figure(figsize=(7,5))
plt.bar(np.arange(len(scree))+1, scree)
plt.plot(np.arange(len(scree))+1, scree.cumsum(),c="red",marker='o')
plt.xlabel("Number of principal components")
plt.ylabel("Percentage explained variance")
plt.title("Scree Plot to check variance ratio")
plt.xticks([1, 2])
plt.text(1.5, 25, (("Variance accumulation with\n2 components : {}%").format((np.cumsum(pca.explained_variance_ratio_)[-1]*100).round(2))), fontsize=12)
plt.show(block=False)
return pca
pcaFunction(wordFastText.iloc[:,1:], 2)
PCA(n_components=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
PCA(n_components=2)
Reducing the dimension of the Fasttext word embedding data with PCA
def pcaReduction(data):
pca = PCA(n_components=2)
return pd.DataFrame(pca.fit_transform(data))
pcaData = pcaReduction(wordFastText.iloc[:,1:])
pcaData['words'] = model_ted.wv.index_to_key
pcaData
| 0 | 1 | words | |
|---|---|---|---|
| 0 | -0.031767 | -0.004287 | good |
| 1 | -0.045535 | -0.004297 | great |
| 2 | -0.017838 | 0.000361 | nice |
| 3 | -0.061231 | -0.000994 | little |
| 4 | 0.037099 | 0.005710 | bad |
| ... | ... | ... | ... |
| 407 | 0.070079 | 0.002232 | predictable |
| 408 | 0.222878 | -0.000449 | asthmatic |
| 409 | -0.093243 | 0.002429 | inevitable |
| 410 | -0.027499 | -0.001423 | cautious |
| 411 | 0.193112 | -0.000559 | inclusive |
412 rows × 3 columns
Reducing the dimension of the Fasttext word embedding data
from sklearn.manifold import TSNE
def tsneReduction(data):
return pd.DataFrame(TSNE(n_components=2).fit_transform(data))
tsneData = tsneReduction(wordFastText.iloc[:,1:])
tsneData['words'] = model_ted.wv.index_to_key
tsneData
| 0 | 1 | words | |
|---|---|---|---|
| 0 | 3.192504 | -0.634607 | good |
| 1 | 0.113141 | -1.754497 | great |
| 2 | 5.619693 | -0.744377 | nice |
| 3 | -3.572501 | -1.655346 | little |
| 4 | 13.541284 | 0.572302 | bad |
| ... | ... | ... | ... |
| 407 | 17.648628 | 1.109645 | predictable |
| 408 | 31.769602 | 0.576736 | asthmatic |
| 409 | -10.732373 | -1.950683 | inevitable |
| 410 | 3.960903 | -1.051612 | cautious |
| 411 | 29.460543 | 0.985976 | inclusive |
412 rows × 3 columns
pcaDataGram = pcaReduction(FastTextGram.iloc[:,1:])
pcaDataGram['words'] = FastTextGram['words']
tsneDataGram = tsneReduction(FastTextGram.iloc[:,1:])
tsneDataGram['words'] = FastTextGram['words']
Clustring the reduced data
from sklearn.cluster import KMeans, MiniBatchKMeans
from yellowbrick.cluster import KElbowVisualizer
# Instantiate the clustering model and visualizer
model = MiniBatchKMeans()
visualizer = KElbowVisualizer(model, k=(2,12), timings=False)
visualizer.fit(pcaData[[0,1]]) # Fit the data to the visualizer
visualizer.show()
<Axes: title={'center': 'Distortion Score Elbow for MiniBatchKMeans Clustering'}, xlabel='k', ylabel='distortion score'>
According to the scores Kmeans at cluster number 5 is giving good resutls.
def clustringBow(data, k):
clusters = KMeans(n_clusters=k)
clusters.fit(data)
return clusters.labels_
pcaData['labels'] = clustringBow(wordFastText.iloc[:,1:], 5)
tsneData['labels'] = clustringBow(wordFastText.iloc[:,1:], 5)
pcaDataGram['labels'] = clustringBow(FastTextGram.iloc[:,1:], 5)
tsneDataGram['labels'] = clustringBow(FastTextGram.iloc[:,1:], 5)
fig, ax = plt.subplots(1, 2, figsize=(20,7), tight_layout=True)
sns.scatterplot(data = pcaData, x=0, y=1, hue="labels", ax=ax[0], palette='Set1')
ax[0].set_title('Distribution of words with PCA')
ax[0].set_xlabel('Principal component 1')
ax[0].set_ylabel('Principal component 2')
sns.scatterplot(data = tsneData, x=0, y=1, hue="labels", ax=ax[1], palette='Set1')
ax[1].set_title('Distribution of words with t-SNE')
ax[1].set_xlabel('Component 1')
ax[1].set_ylabel('Component 2')
Text(0, 0.5, 'Component 2')
fig, ax = plt.subplots(1, 2, figsize=(20,7), tight_layout=True)
sns.scatterplot(data = pcaDataGram, x=0, y=1, hue="labels", ax=ax[0], palette='Set1')
ax[0].set_title('Distribution of words with PCA')
ax[0].set_xlabel('Principal component 1')
ax[0].set_ylabel('Principal component 2')
sns.scatterplot(data = tsneDataGram, x=0, y=1, hue="labels", ax=ax[1], palette='Set1')
ax[1].set_title('Distribution of words with t-SNE')
ax[1].set_xlabel('Component 1')
ax[1].set_ylabel('Component 2')
Text(0, 0.5, 'Component 2')
Number of words per cluster
fig= plt.figure(figsize=(10,5), tight_layout=True)
plt.subplot(1, 2, 1)
countUnigram = sns.countplot(data=pcaData, x='labels')
for p in countUnigram.patches:
txt = str(p.get_height())
txt_x = p.get_x() + 0.3
txt_y = p.get_height()
countUnigram.text(txt_x,txt_y,txt, size=14)
plt.ylabel('Number of words')
plt.xlabel('Cluster num')
plt.title('Number of words in each cluster for Unigram')
plt.subplot(1, 2, 2)
countUnigram = sns.countplot(data=pcaDataGram, x='labels')
for p in countUnigram.patches:
txt = str(p.get_height())
txt_x = p.get_x() + 0.3
txt_y = p.get_height()
countUnigram.text(txt_x,txt_y,txt, size=14)
plt.ylabel('Number of words')
plt.xlabel('Cluster num')
plt.title('Number of words in each cluster for Bi/trigram')
Text(0.5, 1.0, 'Number of words in each cluster for Bi/trigram')
Considering Unigram : Cluster 0 is having the highest number of words followed by 3 and 1 \ Considering Bi/Trigram : Cluster 3 is having the highest number of words followed by 1 and 4
fig, axes = plt.subplots(1,5, figsize=(25,10), sharex=True, sharey=True)
for i, ax in enumerate(axes.flatten()):
fig.add_subplot(ax)
cloudClusterWords = " ".join(cat for cat in pcaData[pcaData['labels']==i]['words'])
plt.gca().imshow(WordCloud(collocations = False, background_color = 'white', width=5000 ,height=7000, colormap='tab20').generate(cloudClusterWords))
plt.gca().set_title('Topic ' + str(i))
plt.gca().axis('off')
For Unigram topics 1, and 3 have pessimistic words, which can be related to service, ambiance, food, and restaurant location.
fig, axes = plt.subplots(1,5, figsize=(25,10), sharex=True, sharey=True)
for i, ax in enumerate(axes.flatten()):
fig.add_subplot(ax)
cloudClusterWords = " ".join(cat for cat in pcaDataGram[pcaDataGram['labels']==i]['words'])
plt.gca().imshow(WordCloud(collocations = False, background_color = 'white', width=5000 ,height=7000, colormap='tab20').generate(cloudClusterWords))
plt.gca().set_title('Topic ' + str(i))
plt.gca().axis('off')
For Unigram, bigram, and trigram groups Topics 1 and 4 has a negative sentiment, which can be related to service, ambiance, food, and restaurant location.
# Instantiate the clustering model and visualizer
model = MiniBatchKMeans()
visualizer = KElbowVisualizer(model, k=(2,12), timings=False, metric="silhouette")
visualizer.fit(idfVector.toarray()) # Fit the data to the visualizer
visualizer.show()
<Axes: title={'center': 'Silhouette Score Elbow for MiniBatchKMeans Clustering'}, xlabel='k', ylabel='silhouette score'>
Clustering with TF-IDF data at 4 cluster and reducing the dimension to visualize.
from sklearn.decomposition import KernelPCA
# Clustering and reducing the dimension
idfTsneDf = pd.DataFrame(TSNE(n_components=2).fit_transform(idfVector.toarray()))
idfTsneDf['labels'] = clustringBow(idfVector.toarray(), 4)
X_pca_dim = KernelPCA(n_components=2).fit_transform(idfVector.toarray())
pca_df = pd.DataFrame(dict(x = X_pca_dim[:, 0], y = X_pca_dim[:,1], Cluster = idfTsneDf['labels'] ))
fig, ax = plt.subplots(1, 3, figsize=(20, 7), tight_layout=True)
# First subplot - Bar plot
sns.barplot(data=idfTsneDf.groupby('labels')['labels'].value_counts(), ax=ax[0], palette='Set1')
ax[0].set_title('Distribution of cluster labels for reviews')
ax[0].set_xlabel('Number of reviews')
ax[0].set_ylabel('Labels')
# Second subplot - Scatter plot with t-SNE
sns.scatterplot(data=idfTsneDf, x=0, y=1, hue="labels", ax=ax[1], palette='Set1')
ax[1].set_title('Distribution of words with t-SNE')
ax[1].set_xlabel('Component 1')
ax[1].set_ylabel('Component 2')
# Third subplot - Scatter plot with PCA
sns.scatterplot(x='x', y='y', data=pca_df, hue='Cluster', palette='Set1', ax=ax[2])
ax[2].set_title('Distribution of words with PCA')
ax[2].set_xlabel('Component 1')
ax[2].set_ylabel('Component 2')
Text(0, 0.5, 'Component 2')
Cluster 0 is having the highest number of reviews, major cluster
fig, axes = plt.subplots(2,2, figsize=(20,10), sharex=True, sharey=True)
idfTsneDf['words'] = processedReviewData['text_lemmatized']
for i, ax in enumerate(axes.flatten()):
fig.add_subplot(ax)
cloudClusterWords = " ".join(" ".join(cat) for cat in idfTsneDf[idfTsneDf['labels']==i]['words'])
plt.gca().imshow(WordCloud(collocations = False, background_color = 'white', colormap='tab20', max_words=50).generate(cloudClusterWords))
plt.gca().set_title('Topic ' + str(i))
plt.gca().axis('off')
Topic 3 is having negative or close to negative words
Latent Dirichlet Allocation (LDA) is a probabilistic model that assumes that every topic is a bag of words and every document is a bag of topics that each can be chosen with from the bag with some probability.
from gensim.models import ldamodel
from gensim.corpora.dictionary import Dictionary
dictionary = Dictionary(processedReviewData['token_lda'])
corpus_bow = [dictionary.doc2bow(text) for text in processedReviewData['token_lda']]
[[(dictionary[id], freq) for id, freq in cp] for cp in corpus_bow[:1]]
[[('appear', 1),
('beef', 1),
('blow', 1),
('burger', 4),
('crap', 1),
('flavor', 1),
('focus', 1),
('ground', 1),
('kroger', 1),
('lunch', 1),
('meat', 1),
('meh', 1),
('menu', 1),
('patty', 2),
('pile', 1),
('state', 1),
('steam', 1),
('water', 1)]]
ldaTopicNum = 15
lda_model = ldamodel.LdaModel(corpus=corpus_bow, # Stream of document vectors or sparse matrix of shape (num_documents, num_terms)
id2word=dictionary, # Mapping from word IDs to words. It is used to determine the vocabulary size, as well as for debugging and topic printing.
num_topics=ldaTopicNum, # The number of requested latent topics to be extracted from the training corpus.
passes=10, #Number of passes through the corpus during training
per_word_topics=True) # computes a list of topics, sorted in descending order of most likely topics for each word, along with their phi values multiplied by the feature length
import warnings
warnings.filterwarnings('ignore')
def worCloudPertopic(model, numTopic, row, col):
cloudLda = WordCloud(stopwords=stop_words,
background_color='white',
max_words=100,
colormap='tab10')
fig, axes = plt.subplots(row,col, figsize=(20,15), sharex=True, sharey=True)
fig.tight_layout()
for i, ax in enumerate(axes.flatten()):
fig.add_subplot(ax)
topic_words = " ".join(x[0] for x in model.show_topics(numTopic,formatted=False, num_words=30)[i][1])
cloudLda.generate(topic_words)
plt.gca().imshow(cloudLda)
plt.gca().set_title('Topic ' + str(i))
plt.gca().axis('off')
plt.subplots_adjust(wspace=0, hspace=0)
plt.axis('off')
plt.margins(x=0, y=0)
plt.tight_layout()
plt.show()
worCloudPertopic(lda_model, ldaTopicNum, 5, 3)
for i in range(0,ldaTopicNum):
print("Topic : ", i)
print(" ".join(x[0] for x in lda_model.show_topics(ldaTopicNum,formatted=False, num_words=10)[i][1]))
Topic : 0 flower okra surgery steakhouse designer penny bone label tahoe watermelon Topic : 1 nail salon pedicure job gel color manicure look polish come Topic : 2 order fry cheese burger chicken sandwich sauce try like potato Topic : 3 store dress love wedding class selection shop buy staff room Topic : 4 park parking museum tree lot sunset winery game gras drive Topic : 5 food service place come drink order wait table time restaurant Topic : 6 ice cream dog bagel chocolate place staff love shop dr Topic : 7 sushi roll breakfast coffee egg santa brunch donut barbara salmon Topic : 8 pizza food place order try service love lunch time restaurant Topic : 9 car service work day time need tell customer come guy Topic : 10 sauce pork chicken shrimp bbq dish salad rib fish try Topic : 11 food like place beer time wine meat try love taste Topic : 12 room hotel stay staff lot area tour location airport walk Topic : 13 time ask tell like come place want look order service Topic : 14 place taco like food eat salsa try love chip mexican
from sklearn.decomposition import LatentDirichletAllocation
feature_names = vectorizer.get_feature_names_out()
lda = LatentDirichletAllocation(
n_components=ldaTopicNum,
max_iter=20
)
lda.fit_transform(idfVector.toarray())
array([[0.39725817, 0.02822674, 0.02822674, ..., 0.02822674, 0.23579419,
0.02822674],
[0.02452442, 0.02452443, 0.02452465, ..., 0.02452443, 0.02452444,
0.02452442],
[0.02456825, 0.02456825, 0.02456825, ..., 0.02456825, 0.02456825,
0.02456825],
...,
[0.02487234, 0.02487233, 0.20476788, ..., 0.02487235, 0.02487233,
0.02487233],
[0.01837954, 0.01837954, 0.01837953, ..., 0.22469078, 0.01837953,
0.01837954],
[0.02273261, 0.02273253, 0.02273251, ..., 0.02273251, 0.02273253,
0.02273252]])
def plot_top_words(model, feature_names, n_top_words, title, row, col):
#Modified from SKlearn
fig, axes = plt.subplots(row,col, figsize=(20, 20))
fig.tight_layout()
axes = axes.flatten()
for topic_idx, topic in enumerate(model.components_):
top_features_ind = topic.argsort()[:-n_top_words - 1:-1]
top_features = [feature_names[i] for i in top_features_ind]
weights = topic[top_features_ind]
ax = axes[topic_idx]
ax.barh(top_features, weights, height=0.6, color="#7451eb")
ax.set_title(f'Topic {topic_idx +1}',
fontdict={'fontsize': 13})
ax.invert_yaxis()
ax.tick_params(axis='both', which='major', labelsize=16)
for i in 'top right left'.split():
ax.spines[i].set_visible(False)
fig.suptitle(title, fontsize=15)
ax.tick_params(bottom=False)
ax.set(xticklabels=[])
plt.subplots_adjust(top=0.93, bottom=0.02, wspace=0.6, hspace=0.14)
plt.show()
plot_top_words(lda, feature_names, 10,'Topics in LDA', 5,3)
def display_topics(model, feature_names, no_top_words):
for topic_idx, topic in enumerate(model.components_):
print("Topic {}:".format(topic_idx))
print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))
display_topics(lda, feature_names, 10)
Topic 0: small easy second good great lucky solid willing particular unprofessional Topic 1: large local good incredible great poor professional available affordable classic Topic 2: bad special extra good personal courteous basic possible new nice Topic 3: hot great good live entire comfortable little true satisfied traditional Topic 4: great good little fresh high impressed flat big organic numerous Topic 5: busy free french open good soft difficult tiny pleased ridiculous Topic 6: low single good chinese normal negative main strong important great Topic 7: fantastic hard knowledgeable good able great fabulous vegetarian exceptional real Topic 8: happy new black healthy great good little gross natural nice Topic 9: reasonable old expensive good great white typical quiet casual nice Topic 10: ready regular average good original safe tough bad additional enjoyable Topic 11: disappointed wrong attentive good outstanding green complete major great personable Topic 12: nice terrible short good huge authentic great big similar dead Topic 13: different italian good social specific necessary small new broad upper Topic 14: delicious horrible great good overall fresh red daily little bad
nfmTopicNumber = 15
from gensim.models import Nmf
nfm_model = Nmf(corpus=corpus_bow, # Stream of document vectors or sparse matrix of shape (num_documents, num_terms)
id2word=dictionary, # Mapping from word IDs to words. It is used to determine the vocabulary size, as well as for debugging and topic printing.
num_topics=nfmTopicNumber, # The number of requested latent topics to be extracted from the training corpus.
passes=10) #Number of passes through the corpus during training
worCloudPertopic(nfm_model, nfmTopicNumber, 5, 3)
for i in range(0,nfmTopicNumber):
print("Topic : ", i)
print(" ".join(x[0] for x in nfm_model.show_topics(nfmTopicNumber,formatted=False, num_words=10)[i][1]))
Topic : 0 pizza experience recommend visit staff love year restaurant work price Topic : 1 order chicken fry sauce come sandwich cheese dish eat salad Topic : 2 hair appointment ask salon love cut animal time zoo leave Topic : 3 know like taco fish want review price thing work business Topic : 4 look like dress try feel nail walk way decide seat Topic : 5 wait service minute ask order customer sit bar restaurant seat Topic : 6 car work tell day alignment drive need hour pay pm Topic : 7 ice cream like flavor roll chocolate love cake try sauce Topic : 8 beer bar try selection time night drink cheese burger menu Topic : 9 service time customer come try place food location star receive Topic : 10 food people like taste chicken day love location review thing Topic : 11 room hotel stay time area people breakfast work lot staff Topic : 12 food table restaurant drink eat menu dinner come server meal Topic : 13 place love burger try recommend sandwich drink order want price Topic : 14 come tell day ask time want work check phone pay
from sklearn.decomposition import NMF
nmf = NMF(n_components=nfmTopicNumber , max_iter=20)
nmf.fit_transform(idfVector.toarray())
array([[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[3.24991929e-04, 4.19272097e-04, 4.06676911e-03, ...,
1.44567139e-02, 0.00000000e+00, 1.10224114e-02],
[5.75158917e-04, 1.19162748e-03, 0.00000000e+00, ...,
0.00000000e+00, 1.75829363e-01, 0.00000000e+00],
...,
[1.09736133e-04, 0.00000000e+00, 2.37331655e-04, ...,
1.56027879e-01, 0.00000000e+00, 2.43606229e-03],
[1.34688251e-03, 4.15642859e-02, 5.60640331e-02, ...,
0.00000000e+00, 5.24868343e-04, 3.46820596e-04],
[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
1.40661755e-01, 1.09685396e-03, 0.00000000e+00]])
plot_top_words(nmf, feature_names, 10,'Topics in NFM', 5,3)
display_topics(nmf, feature_names, 10)
Topic 0: good overall disappointed attentive french second local expensive old open Topic 1: great reasonable overall local easy big knowledgeable huge old high Topic 2: nice special overall big free local open reasonable black able Topic 3: little special able tiny average high red free hard big Topic 4: bad horrible terrible old open wrong poor entire hard low Topic 5: fresh huge local old big healthy high soft able available Topic 6: new old ready open free professional big special hard local Topic 7: small high open local big easy regular overall hard authentic Topic 8: hot huge special big ready extra easy chinese main regular Topic 9: happy old able reasonable attentive outstanding big special free disappointed Topic 10: different big second disappointed huge able high terrible extra available Topic 11: delicious special extra healthy free authentic attentive impressed green chinese Topic 12: busy big attentive ready free horrible old able real poor Topic 13: fantastic old big easy personable knowledgeable free short able high Topic 14: large huge open local special high able entire outstanding comfortable
processedReviewData['ldaGemsimTopics'] = processedReviewData['token_lda'].apply(lambda x : lda_model.get_document_topics(dictionary.doc2bow(x), minimum_probability=0.1))
processedReviewData['clusterLabels'] = idfTsneDf['labels']
processedReviewData['nfmGemsimTopics'] = processedReviewData['token_lda'].apply(lambda x : nfm_model.get_document_topics(dictionary.doc2bow(x), minimum_probability=0.1))
processedReviewData[['text','ldaGemsimTopics', 'clusterLabels', 'nfmGemsimTopics']]
| text | ldaGemsimTopics | clusterLabels | nfmGemsimTopics | |
|---|---|---|---|---|
| 0 | Went for lunch and found that my burger was me... | [(2, 0.2765501), (8, 0.12674528), (11, 0.20354... | 0 | [(1, 0.23430688282735607), (7, 0.1013745352596... |
| 1 | I needed a new tires for my wife's car. They h... | [(9, 0.72247756), (13, 0.21561399)] | 0 | [(1, 0.30649012717452373), (6, 0.6057575442527... |
| 2 | Jim Woltman who works at Goleta Honda is 5 sta... | [(1, 0.9377559)] | 0 | [(0, 0.3981311405112851), (3, 0.13720122666155... |
| 3 | Been here a few times to get some shrimp. The... | [(11, 0.21335638), (14, 0.690344)] | 2 | [(3, 0.2126430775025083), (8, 0.16414513683691... |
| 4 | This is one fantastic place to eat whether you... | [(8, 0.23385039), (11, 0.67947745)] | 3 | [(0, 0.21412922148957647), (13, 0.675145364135... |
| ... | ... | ... | ... | ... |
| 4994 | I am not sure what to think of this place. I b... | [(8, 0.11680629), (9, 0.16956975), (13, 0.6777... | 0 | [(2, 0.1954964385554125), (9, 0.22997305655587... |
| 4995 | I'm so excited to see the red Robin had re-ope... | [(2, 0.21591565), (5, 0.7064889)] | 0 | [(1, 0.12841763016219607), (5, 0.5821628352784... |
| 4996 | This is our go-to pizza place! We love their ... | [(8, 0.6359706), (13, 0.25112677)] | 0 | [(0, 0.6627956410121099), (10, 0.1427711627113... |
| 4997 | This is located in a great spot fairly close t... | [(4, 0.123830244), (7, 0.20269445), (12, 0.312... | 0 | [(3, 0.20750610054368798), (4, 0.1515471170643... |
| 4998 | I went in for a sirloin burger and a salad. Th... | [(2, 0.2131707), (3, 0.12926008), (11, 0.20708... | 0 | [(1, 0.2092417280021048), (8, 0.18271243524895... |
4999 rows × 4 columns
df_ldaVis = pd.DataFrame([val for sublist in processedReviewData['ldaGemsimTopics'] for val in sublist])
df_nfmVis = pd.DataFrame([val for sublist in processedReviewData['nfmGemsimTopics'] for val in sublist])
fig, ax = plt.subplots(1, 3, figsize=(20,7), tight_layout=True)
sns.kdeplot(data = df_ldaVis,x=1, hue=0, ax=ax[0], palette='Set1')
ax[0].set_title('Distribution of Topics ratio for the reviews (LDA)')
ax[0].set_xlabel('Percentage of Topic present')
ax[0].set_ylabel('Reviews')
sns.kdeplot(data = processedReviewData, x='clusterLabels', ax=ax[1], palette='Set1')
ax[1].set_title('Distribution of Clusters over reviews')
# ax[1].set_xlabel('Labels')
# ax[1].set_ylabel('Component 2')
sns.kdeplot(data = df_nfmVis,x=1, hue=0, ax=ax[2], palette="Set1")
ax[2].set_title('Distribution of Topics ratio for the reviews (NFM)')
ax[2].set_xlabel('Percentage of Topic present')
ax[2].set_ylabel('Reviews')
Text(0, 0.5, 'Reviews')
Kmeans Clustering: According to the word representation of the clusters we can label the group by summarising the sense of the word present in the respective clusters.
| Topics | LDA topics (Gensim model) | NMF topics (Gensim model) |
|---|---|---|
| Topic 0 | food order | car service |
| Topic 1 | car, time and day, pay | Animal-friendly place |
| Topic 2 | food order wait time | Fast food |
| Topic 3 | Ambiance, surrounding | Delivery |
| Topic 4 | cream ice place | Nice Interior |
| Topic 5 | sushi, tour guide | Bar |
| Topic 6 | food service | Services |
| Topic 7 | Mexican food | Salon |
| Topic 8 | delivery | City tour |
| Topic 9 | Hotel | Restaurant |
| Topic 10 | Polish | Fast food |
| Topic 11 | Customer service | ice cream, Job |
| Topic 12 | Shopping place | service |
| Topic 13 | Art | hotel, resort |
| Topic 14 | Salon | pizza store |